Skip to content

Blog post for DRA updates in 1.36#54567

Open
mortent wants to merge 4 commits intokubernetes:mainfrom
mortent:DRABlog136
Open

Blog post for DRA updates in 1.36#54567
mortent wants to merge 4 commits intokubernetes:mainfrom
mortent:DRABlog136

Conversation

@mortent
Copy link
Copy Markdown
Member

@mortent mortent commented Feb 20, 2026

Description

This is a PR for the blog post covering DRA updates for 1.36. We plan a single blog post covering all DRA updates rather than individual blog posts for each feature.

Issue

@k8s-ci-robot k8s-ci-robot added this to the 1.36 milestone Feb 20, 2026
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 20, 2026
@netlify
Copy link
Copy Markdown

netlify bot commented Feb 20, 2026

Pull request preview available for checking

Built without sensitive environment variables

Name Link
🔨 Latest commit a1ea462
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-io-main-staging/deploys/69d998c32d35160007ed2ef4
😎 Deploy Preview https://deploy-preview-54567--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@lmktfy
Copy link
Copy Markdown
Member

lmktfy commented Feb 21, 2026

/area blog

@k8s-ci-robot k8s-ci-robot added the area/blog Issues or PRs related to the Kubernetes Blog subproject label Feb 21, 2026
@lmktfy
Copy link
Copy Markdown
Member

lmktfy commented Feb 21, 2026

This PR should target main (all PRs that add blog articles should target main)

@nmn3m
Copy link
Copy Markdown
Member

nmn3m commented Feb 25, 2026

/cc @nmn3m

@harche
Copy link
Copy Markdown
Contributor

harche commented Mar 9, 2026

Hi @mortent, we're planning to fold our Resource Health Status feature (KEP-4680) into this umbrella blog post instead of maintaining a separate one (#54534).

KEP-4680 is reaching Beta in v1.36. It exposes device health information from Device Plugin and DRA in Pod Status. Let us know if you'd like us to contribute a section or provide any input for the post.

@k8s-ci-robot k8s-ci-robot added area/localization General issues or PRs related to localization language/en Issues or PRs related to English language language/ja Issues or PRs related to Japanese language language/ko Issues or PRs related to Korean language language/pl Issues or PRs related to Polish language size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. language/zh Issues or PRs related to Chinese language sig/docs Categorizes an issue or PR as relevant to SIG Docs. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 17, 2026
@mortent mortent changed the base branch from dev-1.36 to main March 17, 2026 22:26
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 17, 2026
@mortent
Copy link
Copy Markdown
Member Author

mortent commented Mar 17, 2026

/wg device-management

more optimal scheduling decisions. To support this capability, the ResourceSlice
controller toolkit now automatically generates names that reflect the exact device
ordering specified by the driver author.

Copy link
Copy Markdown
Contributor

@everpeace everpeace Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to include kubernetes/enhancements#5491 if it's worth putting in the feature blog.

ref: docs PR is #54561

Suggested change
**List Types for Attributes**
With
[List Types for Attributes](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#list-type-attributes),
DRA can represent device attributes as typed lists (int, bool, string, and
version), not just scalar values. This helps model real hardware topology, such
as devices that belong to multiple PCIe roots or NUMA domains.
This feature also extends `ResourceClaim` constraint behavior to work naturally
with both scalar and list values: `matchAttribute` now checks for a non-empty
intersection, and `distinctAttribute` checks for pairwise disjoint values.
It also introduces `includes()` function in DRA CEL, which lets device selectors keep working
more easily when an attribute changes between scalar and list representations.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I forgot this one, it is definitely worth including. Added your suggestion.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my comment on #54567 (comment), do you think we could make it a bit more focused on just the benefits of the feature and leave some of the details to the DRA documentation? And see if we can keep it to a single paragraph?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@everpeace Could you take a look at updating the description to align a bit more with the other features, ref my previous comment?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in now, right?

devices or FPGAs—are fully prepared. By explicitly modeling resource readiness, this
prevents premature assignments that can lead to Pod failures, ensuring a much more robust
and predictable deployment process.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to include kubernetes/enhancements#4680 in the feature blog.

ref: docs PR is #54420

Suggested change
**Resource Health Status (Beta)**
Knowing when a device has failed or become unhealthy is critical for
workloads running on specialized hardware. With
[Resource Health Status](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-health-monitoring),
Kubernetes now exposes device health information directly in the Pod
Status through the `allocatedResourcesStatus` field. When a DRA driver
detects that an allocated device has become unhealthy, it reports this
back to the kubelet, which surfaces it in each container's status.
In 1.36, the feature graduates to beta (enabled by default) and adds
an optional `message` field providing human readable context about the
health status, such as error details or failure reasons. DRA drivers
can also configure per device health check timeouts, allowing different
hardware types to use appropriate timeout values based on their
health reporting characteristics. This gives users and controllers
crucial visibility to quickly identify and react to hardware failures.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've added your proposal for now, but do you think we can shorten it a bit and make it just one paragraph? There is a large number of features and we don't want the blog post to be too long. Focus just on the benefits of this feature and what it enables and leave the details to the DRA docs which we link to. Also, including that it is graduating to beta in 1.36 is already given from the context.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I forgot to add this in the first draft, it is of course something we should include in the blog.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harche Could you take a look at this? Currently the description of this feature gets into quite a bit more detail than the other descriptions and I think some of it can be left to the documentation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @mortent , does this look better ?#54567 (comment)

@lmktfy
Copy link
Copy Markdown
Member

lmktfy commented Mar 31, 2026

/remove-area localization
/remove-language ja
/remove-language ko
/remove-language pl
/remove-language zh

@k8s-ci-robot k8s-ci-robot removed area/localization General issues or PRs related to localization language/ja Issues or PRs related to Japanese language language/ko Issues or PRs related to Korean language language/pl Issues or PRs related to Polish language language/zh Issues or PRs related to Chinese language labels Mar 31, 2026
@pohly pohly moved this from 🏗 In progress to 👀 In review in Dynamic Resource Allocation Apr 7, 2026
Copy link
Copy Markdown
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, thanks for putting this together.

[NVIDIA GPU](https://github.com/NVIDIA/k8s-dra-driver-gpu)
and Google TPU DRA drivers are being transferred to the Kubernetes project, joining the
[DRANET](https://github.com/kubernetes-sigs/dranet)
driver that was added last year.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling those out seems reasonable for a blog post because this is newsworthy.

We could link to https://github.com/kubernetes-sigs/wg-device-management/tree/main/device-ecosystem but I'll defer to SIG Docs about that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can link to https://www.kubernetes.dev/community/community-groups/wg/device-management/

If we want to publish something like https://github.com/kubernetes-sigs/wg-device-management/tree/main/device-ecosystem for end users, we can. Maybe a (separate)
blog article?

It's not a good fit for the official Kubernetes documentation; it has a bit too much about vendors and offerings. But a neutral blog article could be OK.

Why should DRA only be for external accelerators? In v1.36, we are introducing the first
iterations of using the DRA API to manage Kubernetes native resources (like CPU and
memory). By bringing CPU and memory allocation under the DRA umbrella with the DRA
[Native Resources](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#node-allocatable-resources)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be called "Node Allocatable Resources" instead of "Native Resources"?

cc @pravk03

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I used the name from the KEP, but I see that #54598 uses the term node allocatable resources. @pravk03 I assume node allocatable resources is the preferred name here? And do we spell that with a hyphen or not (the docs seems inconsistent on this)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we have renamed this to node allocatable resources.

Following the convention in the Node Allocatable and Resource Management documentation, we can omit the hyphen. I’ll update the the KEP and docs to keep it consistent.

Comment on lines +48 to +50
This allows for a gradual transition to DRA, meaning application developers and
operators are not forced to immediately migrate their workloads to the ResourceClaim
API.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operators here mean DevOps? I am wondering if we need to highlight that this is only a consumption API. All the management and monitoring must be done differently with DRA.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Section has been updated.

[Extended Resource](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#extended-resource)
feature allows users to request resources via traditional extended resources on a Pod.
This allows for a gradual transition to DRA, meaning application developers and
operators are not forced to immediately migrate their workloads to the ResourceClaim
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slight rephrase of "not forced to" to something like "continue using familiar API while exploring all the benefits of a new API" would be better.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rewritten this section a bit. Let me know if you think it looks better now.

[Partitionable Devices](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#partitionable-devices)
feature, provides native DRA support for carving physical hardware into smaller,
logical instances (such as Multi-Instance GPUs). This allows administrators to
safely and efficiently share expensive accelerators across multiple Pods.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the key here is that it is dynamic, while safe and efficient. So the partitioning can dynamically change based on workload demands

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I've reworked the section a little bit.

controller toolkit now automatically generates names that reflect the exact device
ordering specified by the driver author.

## What’s next?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we say that the big priority is to migrate community to DRA? And also make it a call to action?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I've added a small section for this.

health reporting characteristics. This gives users and controllers crucial visibility
to quickly identify and react to hardware failures.

## New Features
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes/enhancements#5304

This enhancement was add in 1.36, I wonder if this section should contain a sub-section for it.

cc @pohly

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add it. Can you suggest something?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +79 to +94
**Resource Health Status (Beta)**

Knowing when a device has failed or become unhealthy is critical for workloads running on
specialized hardware. With
[Resource Health Status](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-health-monitoring),
Kubernetes now exposes device health information directly in the Pod Status through the
`allocatedResourcesStatus` field. When a DRA driver detects that an allocated device
has become unhealthy, it reports this back to the kubelet, which surfaces it in each
container's status.

In 1.36, the feature graduates to beta (enabled by default) and adds an optional `message`
field providing human readable context about the health status, such as error details or
failure reasons. DRA drivers can also configure per device health check timeouts,
allowing different hardware types to use appropriate timeout values based on their
health reporting characteristics. This gives users and controllers crucial visibility
to quickly identify and react to hardware failures.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Resource Health Status (Beta)**
Knowing when a device has failed or become unhealthy is critical for workloads running on
specialized hardware. With
[Resource Health Status](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-health-monitoring),
Kubernetes now exposes device health information directly in the Pod Status through the
`allocatedResourcesStatus` field. When a DRA driver detects that an allocated device
has become unhealthy, it reports this back to the kubelet, which surfaces it in each
container's status.
In 1.36, the feature graduates to beta (enabled by default) and adds an optional `message`
field providing human readable context about the health status, such as error details or
failure reasons. DRA drivers can also configure per device health check timeouts,
allowing different hardware types to use appropriate timeout values based on their
health reporting characteristics. This gives users and controllers crucial visibility
to quickly identify and react to hardware failures.
**Resource Health Status (Beta)**
Knowing when a device has failed or become unhealthy is critical for workloads running on
specialized hardware. With
[Resource Health Status](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-health-monitoring),
Kubernetes now exposes device health information directly in Pod Status, giving users and
controllers crucial visibility to quickly identify and react to hardware failures. In 1.36,
the feature graduates to beta (enabled by default) and adds support for human readable
health status messages, making it easier to diagnose issues without diving into driver logs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend writing the graduation as if it was in the past (which it will be). Also, watch out for implying that beta → enabled by default. Some things go to beta as initially opt-in.

controller toolkit now automatically generates names that reflect the exact device
ordering specified by the driver author.

## What’s next?
Copy link
Copy Markdown
Contributor

@alaypatel07 alaypatel07 Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## What’s next?
**Discoverable Device Metadata in Containers**
Workloads running on with DRA devices often need to discover details about
their allocated devices, such as PCI bus addresses or network
interface configuration, without querying the Kubernetes API. With
[DRA Device Metadata](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-metadata),
Kubernetes defines a standard protocol for how DRA drivers expose device
attributes to containers as versioned JSON files at well-known paths. Drivers
built with the
[DRA kubelet plugin library](https://pkg.go.dev/k8s.io/dynamic-resource-allocation/kubeletplugin)
get this behavior transparently; they just provide the metadata and the
library handles file layout, CDI bind-mounts, versioning, and lifecycle. This
gives applications a consistent, driver-independent way to discover and
consume device metadata, eliminating the need for custom controllers or
looking up the of ResourceSlice objects to get metadata via attributes.
## What’s next?

Copy link
Copy Markdown
Member

@lmktfy lmktfy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some more feedback. The main thing I recommend is switching to real headings (not just a bold paragraph) about each feature we're covering.

The community has been hard at work stabilizing core DRA concepts. In Kubernetes 1.36,
several highly anticipated features have graduated to Beta and Stable.

**Prioritized List (Stable)**
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use an actual heading here; also, use sentence case, eg

### Prioritized list (stable) {#prioritized-list}

and similarly for other headings

**Prioritized List (Stable)**

Hardware heterogeneity is a reality in most clusters. With the
[Prioritized List](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#prioritized-list)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idiomatically, don't use title case for the names of features. Outside of a hyperlink we use italics; within a hyperlink they are optional.

has become unhealthy, it reports this back to the kubelet, which surfaces it in each
container's status.

In 1.36, the feature graduates to beta (enabled by default) and adds an optional `message`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(for this article, only a nit)

Remember that this is a post release blog:

Suggested change
In 1.36, the feature graduates to beta (enabled by default) and adds an optional `message`
In 1.36, the feature has graduated to beta (and is now enabled by default). There was
a small change from alpha, adding an optional `message`

Comment on lines +79 to +94
**Resource Health Status (Beta)**

Knowing when a device has failed or become unhealthy is critical for workloads running on
specialized hardware. With
[Resource Health Status](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-health-monitoring),
Kubernetes now exposes device health information directly in the Pod Status through the
`allocatedResourcesStatus` field. When a DRA driver detects that an allocated device
has become unhealthy, it reports this back to the kubelet, which surfaces it in each
container's status.

In 1.36, the feature graduates to beta (enabled by default) and adds an optional `message`
field providing human readable context about the health status, such as error details or
failure reasons. DRA drivers can also configure per device health check timeouts,
allowing different hardware types to use appropriate timeout values based on their
health reporting characteristics. This gives users and controllers crucial visibility
to quickly identify and react to hardware failures.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend writing the graduation as if it was in the past (which it will be). Also, watch out for implying that beta → enabled by default. Some things go to beta as initially opt-in.

**ResourceClaim Support for Workloads**

To optimize large-scale AI/ML workloads that rely on strict topological scheduling, the
[ResourceClaim Support for Workloads](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#workload-resourceclaims)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to describe what is possible (rather than summarizing the enhancement).

eg

People using Kubernetes in their platform may have large AI/ML workloads
and rely on strict _topological scheduling_ (matching Pods to run across multiple nodes
with firm constraints, such as making an entire rack of compute available
along with interconnects).
DRA provides an [integration](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#workload-resourceclaims)
with the Workload API, so that you can get near seamless management of
infrastructure resources, even across very large sets of Pods.

By associating ResourceClaims or ResourceClaimTemplates with the PodGroup API
(also new in Kubernetes v1.35),
this integration eliminates previous scaling bottlenecks, such as the limit on the
number of Pods that can share a claim, and
removes the burden of custom or manual claim
management from specialized orchestrators.

Notice how I didn't mention the feature other than by hyperlinking to its docs. I would consider also mentioning the feature gate on this one.

The DRA team
---

Dynamic Resource Allocation (DRA) has fundamentally changed how we handle hardware
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Dynamic Resource Allocation (DRA) has fundamentally changed how we handle hardware
Dynamic Resource Allocation (DRA) has fundamentally changed how platform administrators can handle hardware

If "we" means "all the contributors to DRA" then it's best not to have it also mean "anyone using Kubernetes", within one article.

resources like memory and CPU, and support for ResourceClaims in PodGroups.

We have also seen significant momentum in driver availability. Both the
[NVIDIA GPU](https://github.com/NVIDIA/k8s-dra-driver-gpu)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link is stale, and the transfer happened before the v1.36 release. Let's update and switch to past tense.


**Device Binding Conditions (Beta)**

To improve scheduling reliability, the Kubernetes scheduler can now use the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Watch out for "now"; was this not possible in v1.35?

If it was, even behind a feature gate, we should reword.

Knowing when a device has failed or become unhealthy is critical for workloads running on
specialized hardware. With
[Resource Health Status](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-health-monitoring),
Kubernetes now exposes device health information directly in the Pod Status through the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Kubernetes now exposes device health information directly in the Pod Status through the
Kubernetes can expose device health information directly in the Pod `.status`, through an entry within the

?

specialized hardware. With
[Resource Health Status](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-health-monitoring),
Kubernetes now exposes device health information directly in the Pod Status through the
`allocatedResourcesStatus` field. When a DRA driver detects that an allocated device
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`allocatedResourcesStatus` field. When a DRA driver detects that an allocated device
`.status.allocatedResourcesStatus` field.
When a compatible DRA driver detects that an allocated device

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/blog Issues or PRs related to the Kubernetes Blog subproject cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. language/en Issues or PRs related to English language sig/docs Categorizes an issue or PR as relevant to SIG Docs. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Projects

Status: 👀 In review

Development

Successfully merging this pull request may close these issues.